library(fpp3)
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.3
## ✔ dplyr 1.1.2 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.0 ✔ feasts 0.3.1
## ✔ lubridate 1.9.2 ✔ fable 0.3.3
## ✔ ggplot2 3.4.2 ✔ fabletools 0.3.3
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ purrr 1.0.1 ✔ stringr 1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(latex2exp)
library(urca)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(readabs)
## Environment variable 'R_READABS_PATH' is unset. Downloaded files will be saved in a temporary directory.
## You can set 'R_READABS_PATH' at any time. To set it for the rest of this session, use
## Sys.setenv(R_READABS_PATH = <path>)
# Use your student ID as the seed
set.seed(31851096)
myseries <- aus_retail |>
# Remove discontinued series
filter(!(`Series ID` %in% c("A3349561R","A3349883F","A3349499L","A3349902A",
"A3349588R","A3349763L","A3349372C","A3349450X",
"A3349679W","A3349378T","A3349767W","A3349451A"))) |>
# Select a series at random
filter(`Series ID` == sample(`Series ID`,1))
##Discussion of Statistical Features
myseries |> autoplot(Turnover ) + ylab('Turnover for South Australain Retail in $millions AUD')
Plotted above is the turnover for the retail sector of South Australia From 1982 April to 2018 December. On a first glimpse, we can see a general upwards trend which is strong other than during 2012 to 2015. Moreover, there is a clear seasonal pattern which increases in magnitude as time goes on. The drop in January each year is due to the Festive season and holidays.
myseries |>
gg_season(Turnover, labels = "both") +
labs(y = "AUD$ (millions)",
title = "Seasonal plot: South Australian Retail sales")
Above we can see a plot which layers the years on top of each otehr. We can clearly see that the January of each month is a drop from a high in December which confirms that this is becasue of the holiday season and there is generally not much of a seasonality other than a slight increase in March and May. Furthermore can see that the years are icnrease from the bottom to the top which is evidence for our trend in the data. However we can see some interference between the lines as they intersect notably 2012, 2013 and, 2008. A clear example of this would be the 2013 February being lower than 2008 of the same month and 2009 February being lower than 2008 February. While we can appoint 2008 and 2009 to the Gloal recession, 2012 and 2013 maybe have been due to a period of stagnation where we didn’t see much growth in the retail sector.
myseries |> gg_subseries() +
labs(y = "AUD$ (millions)",
title = "Subseries plot: South Australian Retail sales")
## Plot variable not specified, automatically selected `y = Turnover`
The above digram plots the January and subsequent months as their own seperate time series. The blue line is the average value for each of the months. We can see the seasonality in much more clear pattern here. we can see how there is a drop from Jan to Feb and slight increases in March and may and then a building increase in the months leading upto december when we get a spike. we can also see that there is a stagnation in all months during the first half of the 1990s and then another set of kinks inbetween 2000 and 2005 in all but december. All months also face kinks around 2010.
Referring to the autoplot of Turnover, we an see how there is a increase in the variation between the seasonal components. Moreover the trend is not very linear as it does increase but also doesn’t do it in a very straight line. Thus we will choose to apply a box cox transformation. So first we will go through the different values of lambda to transform the turnover data.
myseries |>
autoplot(box_cox(Turnover, 2)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed South Australian Retail with $\\lambda$ = 2 ")))
As a starting point we took a value of 2 to start the trial and error of lambda as a transformation. However, we can see that instead of the decrease in variation, it has increased. Thus we realise that the correct direction is go towards less than 1.
myseries |>
autoplot(box_cox(Turnover, 0.5)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed South Australian Retail with $\\lambda$ = 0.5 ")))
We can see that 0.5 is slightly better than the 2 of the lambda value. However, there is still an increase in the variation of the seasonality.
myseries |>
autoplot(box_cox(Turnover, 0.1)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed South Australian Retail with $\\lambda$ = 0.1 ")))
Here we can see that the variation is much lesser than what it was with 0.5 as lambda. However to find the perfect value of lambda that makes the seasonality not differ, we can apply the guerrero method.
lambda <- myseries |>
features(Turnover, features = guerrero) |>
pull(lambda_guerrero)
myseries_transformed <- myseries |> mutate(Turnover = box_cox(myseries$Turnover, lambda)) |> as_tsibble()
myseries_transformed |> autoplot() +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed Turnover in South Australia with $\\lambda$ = ",
round(lambda, 3))))
## Plot variable not specified, automatically selected `.vars = Turnover`
Using the guererro method, the lambda value selected is 0.977 or roughly 0.1. This is only a marginal increase.
##Explanation of transformation
myseries |>
features(box_cox(Turnover, lambda), unitroot_kpss)
## # A tibble: 1 × 4
## State Industry kpss_stat kpss_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 South Australia Other retailing n.e.c. 7.27 0.01
Using the unitroot_kpss, we can see that the our kpss_pvalue is 0.01, which is lower than the 0.05. This tells us that we can reject the null and say that the data in myseries is not stationiary.
myseries |>
mutate(turnover_transform = difference(box_cox(Turnover, lambda))) |>
features(turnover_transform, unitroot_kpss)
## # A tibble: 1 × 4
## State Industry kpss_stat kpss_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 South Australia Other retailing n.e.c. 0.0185 0.1
Looking at the differened data, we can see that the kpss_value is 0.1 which is significant. and so we can say that the data is stationairy.
myseries |>
summarise(Turnover = box_cox(Turnover, lambda)) |>
features(Turnover, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
Taking the sum of the tranformed turnover, using the ndiffs function for checking how many first differences are required. The answer is 1 differencing.
myseries |>
mutate(box_cox_Turnover = difference((box_cox(Turnover, lambda)))) |>
features(box_cox_Turnover, unitroot_nsdiffs)
## # A tibble: 1 × 3
## State Industry nsdiffs
## <chr> <chr> <int>
## 1 South Australia Other retailing n.e.c. 1
Using the seasonally differenced data, we can see that with the data in the nsdiffs to see if we need seasonal difference. With the output as 0, we know that we need a seasonal difference.
##Methodology to create a short-list of appropriate ARIMA models and ETS models
In order to derive a list of ARIMA and ETS models we need to first see the data with its AFC and PAFC plots to do an ARIMA model by hand.
myseries |>
gg_tsdisplay(box_cox(Turnover, lambda) |> difference(12)|> difference(), plot_type = "partial", lag = 60)+
labs(title = "Residuals",y="Turnover",x="Time")
## Warning: Removed 13 rows containing missing values (`geom_line()`).
## Warning: Removed 13 rows containing missing values (`geom_point()`).
From the above, we can see that in the ACF, there is a non seasonal lag of 11, and a seasonal lag of 2. Similarly, in the PACF, there is a non seasonal lag of 11 and a seasonal lag of 3. Thus the equations that come out are ARIMA(11,1,0)(4,1,0)[12] and ARIMA(0,1,11)(0,1,2)[12].
However we will need to short list more Arima equations. However, here we can leave it up to the ARIMA function to select its own parameters. Moreover, we can do another ARIMA model which has step wise as FALSE which while makes it slower but more through. Another things that would make the model slower but make it more thorugh is the approximation parameter being set as FALSE.
Similarly, we can do an automatic ETS models where we leave it upto the ETS function to decide the parameters for us. We can do a simple additive model or ETS(A,A,A). Also a Multiplicative model but with a dampened additive trend can be done while seeing the strong trend, it is a retail industry with a mostly developed economy thus the trend might slow and multiplicative as the data will be strictly positive.
fit<- myseries|>
model(
ETS = ETS(Turnover),
Dampened = ETS(Turnover~error("M") + trend("Ad") + season("M")),
Additive = ETS(Turnover~error("A") + trend("A") + season("A")),
ARIMA_auto = ARIMA(box_cox(Turnover, lambda)),
ARIMA = ARIMA(box_cox(Turnover, lambda),stepwise = FALSE, approximation = FALSE),
kanishk_MA = ARIMA(box_cox(Turnover, lambda) ~ 0 + pdq(0, 1, 11) + PDQ(0,1,2)),
Kanishk_AR = ARIMA(box_cox(Turnover, lambda) ~ 0 + pdq(11,1,0) + PDQ(4,1,0))
)
report(fit) |> arrange(AICc)
## Warning in report.mdl_df(fit): Model reporting is only supported for individual
## models, so a glance will be shown. To see the report for a specific model, use
## `select()` and `filter()` to identify a single model.
## # A tibble: 7 × 13
## State Industry .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Sout… Other r… ARIMA 0.00485 531. -1045. -1045. -1013. NA NA NA
## 2 Sout… Other r… ARIMA… 0.00494 526. -1038. -1038. -1010. NA NA NA
## 3 Sout… Other r… kanis… 0.00486 533. -1037. -1036. -980. NA NA NA
## 4 Sout… Other r… Kanis… 0.00502 528. -1025. -1023. -960. NA NA NA
## 5 Sout… Other r… ETS 0.00261 -1637. 3310. 3312. 3384. 10.8 14.4 0.0393
## 6 Sout… Other r… Dampe… 0.00261 -1637. 3310. 3312. 3384. 10.8 14.4 0.0393
## 7 Sout… Other r… Addit… 9.64 -1834. 3702. 3704. 3772. 9.29 13.4 2.18
## # ℹ 2 more variables: ar_roots <list>, ma_roots <list>
report(fit[[3]][[1]])
## Series: Turnover
## Model: ETS(M,Ad,M)
## Smoothing parameters:
## alpha = 0.5312565
## beta = 0.007262025
## gamma = 0.07245748
## phi = 0.9799999
##
## Initial states:
## l[0] b[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5]
## 13.3046 0.1965592 0.9741105 0.9029897 0.9558791 1.440852 1.033417 0.9943028
## s[-6] s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.9619785 0.96566 0.9393695 0.9088456 0.9757599 0.9468346
##
## sigma^2: 0.0026
##
## AIC AICc BIC
## 3310.116 3311.737 3383.719
Delving deeper we can see that the model that ETS fucntion selected on its own is the same as a multiplicative model iwth dampened trend thus verifying our metrics being equal.
As from 2 chunks and the output above, the models have been trained and then put into the report function for metrics. we choose the best models through minimising AICc. The arrange function gives us an ascending order of the same. The Arima model with approximation and stepwise turned off(FALSE) give better results but only marginally. While the MA parameters picked from the residuals plots give us fairly similar AICc values, the AR models is bit off with a AICc value of -1023.341. While all ARIMA models do better than the ETS models, the automatic ETS and the Dampened model do an equally well job with both having a AICc value of 3311.737.
However, as this is on the entire data, we might want to rather do a split into testing and training data.
train <- myseries |>
filter(yearmonth(Month)<yearmonth("2017 Jan"))
test <- myseries |>
filter(yearmonth(Month)>=yearmonth("2017 Jan"))
Train_fit <- train |>
model(
ETS = ETS(Turnover),
Dampened = ETS(Turnover~error("M") + trend("Ad") + season("M")),
Additive = ETS(Turnover~error("A") + trend("A") + season("A")),
ARIMA_auto = ARIMA(box_cox(Turnover, lambda)),
ARIMA = ARIMA(box_cox(Turnover, lambda),stepwise = FALSE, approximation = FALSE),
kanishk_MA = ARIMA(box_cox(Turnover, lambda) ~ 0 + pdq(0, 1, 11) + PDQ(0,1,2)),
Kanishk_AR = ARIMA(box_cox(Turnover, lambda) ~ 0 + pdq(11,1,0) + PDQ(4,1,0))
)
report(Train_fit) |> arrange(AICc)
## Warning in report.mdl_df(Train_fit): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.
## # A tibble: 7 × 13
## State Industry .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 South A… Other r… ARIMA 0.00482 501. -987. -986. -954. NA NA NA
## 2 South A… Other r… ARIMA… 0.00486 500. -984. -983. -952. NA NA NA
## 3 South A… Other r… kanis… 0.00475 504. -980. -979. -924. NA NA NA
## 4 South A… Other r… Kanis… 0.00508 496. -960. -959. -896. NA NA NA
## 5 South A… Other r… ETS 0.00249 -1502. 3039. 3041. 3112. 8.46 11.4 0.0382
## 6 South A… Other r… Dampe… 0.00249 -1502. 3039. 3041. 3112. 8.46 11.4 0.0382
## 7 South A… Other r… Addit… 9.40 -1717. 3468. 3469. 3537. 9.04 13.0 2.09
## # ℹ 2 more variables: ar_roots <list>, ma_roots <list>
Looking at the fit of the training split of the data, we can see that the AICc has increased which is worse. However, the real test would be with the testing data.
forecast <- Train_fit |>
forecast(h = 24) |> filter( Month > yearmonth("2017 Jan" ))
forecast |>
autoplot(test)+
labs(title = "Forecast on 2 year test data for different models", y="Turnover in $AUD (millions)", x="Time")
The above plot is the forecast for two years with turnover at the Y axis and the time on X axis. We can see the models get the seasonality well mostly. While some models have worse performance namely ETS, the ARIMA models are very similar to each other. However, to narrow down the focus and choose the 1 ARIMA and 1 ETS model.
For the ARIMA model, the chosen one is ARIMA with the stepwise and approximation turned FALSE as its AICc value was the lowest and the ETS models do fairly similar in the forecast.
Moreover, For ETS, the the automatic ETS model will be chosen as it had the best AICc value of the ETS models as while both ETS and Dampened models perform the same even in the forecast, the additive does better on the forecast but it doesn’t reflect on the AICc values. Hence it may have performed well on the forecast by chance.
##Best Arima and ETS
ARIMA <- train |>
model(ARIMA = ARIMA(box_cox(Turnover, lambda),stepwise = FALSE, approx = FALSE))
report(ARIMA)
## Series: Turnover
## Model: ARIMA(3,0,2)(0,1,1)[12] w/ drift
## Transformation: box_cox(Turnover, lambda)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sma1 constant
## 0.0862 0.1258 0.6242 0.5213 0.4017 -0.8458 0.0124
## s.e. 0.1043 0.0868 0.0843 0.1286 0.1114 0.0434 0.0012
##
## sigma^2 estimated as 0.004822: log likelihood=501.26
## AIC=-986.51 AICc=-986.15 BIC=-954.48
#we selected the dof as 6 based on the coefficients
ARIMA |>
gg_tsresiduals()
In the above residual plots, the scatter plots looks stationary with the variation increasing a bit towards the end. Furthermore, there are no significant lags in the ACF plots. As for the histogram, it looks a little bit skewed to the right but it is alright.
ARIMA |>
augment() |>
features(.innov, ljung_box, dof = 6, lag = 24)
## # A tibble: 1 × 5
## State Industry .model lb_stat lb_pvalue
## <chr> <chr> <chr> <dbl> <dbl>
## 1 South Australia Other retailing n.e.c. ARIMA 21.0 0.277
The P value for the ljung box cox test is 0.2011204, which is good as it is greater than 0.05. Thus it can be said that the data is white noise.
# Forecast
forecast_1 <- train |>
model(ARIMA = ARIMA(box_cox(Turnover, lambda),stepwise = FALSE, approx = FALSE)) |>
forecast(h=24)
forecast_1 |>
head(1) |>
mutate(interval = hilo(Turnover, 95)) |>
pull(interval)
## <hilo[1]>
## [1] [79.92105, 95.28957]95
###ETS model training fit
ETS <- train |>
model(ETS = ETS(Turnover))
report(ETS)
## Series: Turnover
## Model: ETS(M,Ad,M)
## Smoothing parameters:
## alpha = 0.5859641
## beta = 0.005823885
## gamma = 0.000101264
## phi = 0.98
##
## Initial states:
## l[0] b[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5]
## 13.86604 0.151632 0.9664761 0.9022304 0.9576749 1.422967 1.046054 1.0025
## s[-6] s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.9603617 0.9625781 0.9438544 0.9130962 0.9723232 0.9498838
##
## sigma^2: 0.0025
##
## AIC AICc BIC
## 3039.087 3040.805 3111.682
ETS |>
gg_tsresiduals(lag_max = 24)
ETS |>
augment() |>
features(.innov, ljung_box, lag = 24)
## # A tibble: 1 × 5
## State Industry .model lb_stat lb_pvalue
## <chr> <chr> <chr> <dbl> <dbl>
## 1 South Australia Other retailing n.e.c. ETS 84.3 0.0000000123
While the data seems like white noise in the scatter plot, the ACF lags paint a different picture. This suggests Autocorrelation which backed up by the ljung test and its p value. However, as this is the best model, for the purposes of this report, the ETs model will be considered against the ARIMA model.
##Results Comparison
# Forecast
forecast_2 <- train |>
model(ETS = ETS(Turnover)) |>
forecast(h=24)
forecast_2 |>
autoplot(train) + labs(title = "ETS Forecast", y = "Turnover in $AUD (Millions)")
Above is the forecasting plot for the ETS model. Showing both the the point forecast through a line and also an interval with the shaded area.
forecast_1 |>
autoplot(train) + labs(title = "ARIMA Forecast", y = "Turnover in $AUD (Millions)")
Similarly, this is for the ARIMA forecast. The forecast doesnt look like it connects to the entire training as the test data starts from Jan as there is huge drop in January.
forecast_2 |>
head(1) |>
mutate(interval = hilo(Turnover, 95)) |>
pull(interval)
## <hilo[1]>
## [1] [78.2184, 95.17804]95
#Q5
bind_rows(
ARIMA |> forecast(h = 24) |> accuracy(test),
ETS |> forecast(h = 24) |> accuracy(test))|>
select(-ME, -MPE, -ACF1, -MASE, -RMSSE)
## # A tibble: 2 × 7
## .model State Industry .type RMSE MAE MAPE
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ARIMA South Australia Other retailing n.e.c. Test 12.5 11.8 11.1
## 2 ETS South Australia Other retailing n.e.c. Test 16.7 15.6 14.6
Here, we can see that the models with the metrics such as RMSE, MAE and MAPE. We can see there is a difference between the ARIMA and ETS with a smaller RMSE being better from the ARIMA. While ARIMA has a RMSE of 12.54759, ETS is 16.69230. Thus, we can see that on average, the errors from ETS are wider. Thus as we will see later on, the ETS will also create wider confidence intervals as the data its trained on is also has relatively wide errors.
Final <- myseries |>
model (ARIMA = ARIMA(box_cox(Turnover, lambda),stepwise = FALSE, approximation = FALSE),
ETS = ETS(Turnover))
Final |>
forecast(h = 24)|>
autoplot(myseries |> select(-`Series ID`))+
labs(title = "Forecast for BOTH ARIMA and ETS", subtitle = "ETS Vs ARIMA", y = "Turnover in $AUD (millions)")
Seeing both ARIMA and ETS together, we can see how do the different
forecasts compare. While the ETS confidence intervals are wider,
ETS_plot <- Final|>
select(c(State, Industry, ETS)) |>
forecast(h = 24)|>
autoplot(myseries , level = 80)+
labs(title = "", subtitle = "ETS Forecasting", y = "Turnover in $AUD (Millions)")
ARIMA_plot <- Final |>
select(c(State, Industry, ARIMA))|>
forecast(h = 24)|>
autoplot(myseries, level = 80)+
labs(title = "forecasts", subtitle = "ARIMA Forecasting", y = "Turnover in $AUD (Millions)")
gridExtra::grid.arrange(ETS_plot,ARIMA_plot)
new_data <- read_abs(series_id = myseries$`Series ID`[1]) |> select(date, value, series_id) |>
mutate(Month = yearmonth(date),
Turnover = value) |>
select(Month, Turnover) |>
filter(Month > max(myseries$Month)) |>
as_tsibble(index=Month)
## Finding URLs for tables corresponding to ABS series IDA3349577J
## Attempting to download files from series IDA3349577J, Retail Trade, Australia
## Extracting data from downloaded spreadsheets
## Tidying data from imported ABS spreadsheets
Final_forecast <- Final |>
forecast(h = 51)
Final_forecast |> select(c(.model ,Month, Turnover, .mean)) |> autoplot(new_data) + labs(y = "Turnover in $AUD (Millions)" )
Final_forecast |> accuracy(new_data)
## # A tibble: 2 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA Test 22.6 24.7 22.6 15.5 15.5 NaN NaN 0.686
## 2 ETS Test 30.3 33.1 30.4 20.9 20.9 NaN NaN 0.735
According to the above table, we can see that with the new data from the abs function, the RMSE for both the models increases. A similar story in both MAE and MAPE as well. However, even with the new data, we can see that the ARIMA model is better still with a RMSE of 24.68043 which is better than that of ETS model RMSE coming to 33.13486. Thus the comparison of the forecasts with the actual numbers tell us that ARIMA is better model at the point forecasts. While with the autoplot we can see that with wider intervals the ETS model is more like to capture a the data.
However, from a characteristic point of view, we know that ARIMA model use the autocorrelation in the data while ETS is more so the function of the trend and seasonality of the data. Thus a disadvantage of ETS is that it lags behind the data which explains the Exponential smoothing. Although in the case of ARIMA, it not good at modelling the outliers and thus better for data with fewer shocks to the data.
For the data provided, the ETS model did not perform well. Given the ACF plot with many significant lags and similarly the ljung box test, we know the ETS model selected automatically was not able to model the data well. It could not explain the data as the p value for the l jung box was less than 0.05.
Moreover, we know that ETS are special cases of the ARIMA thus when we choose the ARIMA function and let it figure out the best parameters, we are giving it the freedom to do a better job in most cases (cases where the automatic ETS and ARIMA yield different models). Thus the ARIMA model chosen automatically is the best modle for the data provided.